The Importance of Proper Weighting Methods

نویسنده

  • Chris Buckley
چکیده

The impor tance of good weighting methods in information retrieval methods tha t stress the most useful features of a document or query representat ive is examined. Evidence is presented tha t good weighting methods are more impor tan t than the feature selection process and it is suggested tha t the two need to go handin-hand in order to be effective. The paper concludes with a me thod for learning a good weight for a t e rm based upon the characterist ics of tha t term. 1. I N T R O D U C T I O N Other than experimental results, the first part of this paper contains little new material. Instead, it's an attempt to demonstrate the relative importance and difficulties involved in the common information retrieval task of forming documents and query representatives and weighting features. This is the sort of thing that tends to get passed by word of mouth if at all, and never gets published. However, there is a tremendous revival of interest in information retrieval; thus this a t tempt to help all those new people just starting in experimental information retrieval. A common approach in many areas of natural language processing is to 1. Find "features" of a natural language excerpt 2. Determine the relative importance of those features within the excerpt 3. Submit the weighted features to some taskappropriate decision procedure This presentation focuses on the second sub:task above: the process of weighting features of a natural language representation. Features here could be things like single word occurrences, phrase occurrences, other relationships between words, occurrence of a word in a title, part-of-speech of a word, automatically or manually assigned categories of a document, citations of a document, and so on. The particular overall task addressed here is that of information retrieval finding textual documents (from a large set of documents) that are relevant to a user's information need. Weighting features is something that many information retrieval systems seem to regard as being of minor importance as compared fo finding the features in the first place; but the experiments described here suggest that weighting is considerably more important than additional feature selection. This is not an argument that feature selection is unimportant, but that development of feature selection and methods of weighting those features need to proceed hand-in-hand if there is to be hope of improving performance. There have been many papers (and innumerable unpublished negative result experiments) where authors have devoted tremendous resources and intellectual insights into finding good features to help represent a document, but then weighted those features in a haphazard fashion and ended up with little or no improvement. This makes it extremely difficult for a reader to judge the worthiness of a feature approach, especially since the weighting methods are very often not described in detail. Long term, the best weighting methods will obviously be those that can adapt weights as more information becomes available. Unfortunately, in information retrieval it is very difficult to learn anything useful from one query that will be applicable to the next. In the routing or relevance feedback environments, weights can be learned for a query and then applied to that same query. But in general there is not enough overlap in vocabulary (and uses of vocabulary) between queries to learn much about the usefulness of particular words. The second half of this paper discusses an approach that learns the important characteristics of a good term. Those characteristics can then be used to properly weight all terms. Several sets of experiments are described, with each set using different types of information to determine the weights of features. All experiments were done with the SMART information retrieval system, most using the T R E C / T I P S T E R collections of documents, queries, and relevance judgements. Each run is evaluated using the " l l -point recall-precision average" evaluation method that was standard at the TREC 1 conference.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing a model for simulating urban expansion based on the concept of decision risk: A case study in Babol city

Today, the study of the spatial-temporal pattern of urban physical expansion and the identification of the parameters affecting the expansion play a crucial role in urban-related decision-making and long-term planning processes. Consequently, the use of precise and efficient methods to predict the physical expansion of urban areas is of great importance. The objective of present study is to pro...

متن کامل

A Margin-based Model with a Fast Local Searchnewline for Rule Weighting and Reduction in Fuzzynewline Rule-based Classification Systems

Fuzzy Rule-Based Classification Systems (FRBCS) are highly investigated by researchers due to their noise-stability and  interpretability. Unfortunately, generating a rule-base which is sufficiently both accurate and interpretable, is a hard process. Rule weighting is one of the approaches to improve the accuracy of a pre-generated rule-base without modifying the original rules. Most of the pro...

متن کامل

مقایسه شرایط ایمنی جاده‌ ای در ایران با ده کشور آسیای جنوب شرقی با استفاده از شاخص توسعه شرایط ایمنی جاده‌ها

Background and aims: Crashes and its related human and financial losses have been turned to be one of the challenges of human societies so that road death statistics shows a high number of 219172 in Iran from 2005 to 2014. Methods: In this research, using global and valid Road Safety Development Index (RSDI) that illustrates general roads safety condition and consists of nine effective par...

متن کامل

Assessment of pollution control technologies by using decision support systems

  Background and aims: Air pollution reduction is important in health being of people and environment. Applying of an effective and efficient strategy is key consideration in facing with environmental challenges of management and control of air pollution. Recently, the main effort of environmental researchers is finding low cost and effective methods to control of environmental pollutants. Po...

متن کامل

Design, Analysis and Simulation of a Linear Phase Distributed Amplifier

In this paper a new method for the design of a linear phase distributed amplifier in 180nm CMOS technology is presented. The method is based on analogy between transversal filters and distributed amplifiers topologies. In the proposed method the linearity of the phase at frequency range of 0-50 GHz is obtained by using proper weighting factors for each gain stage in cascaded amplifier topology....

متن کامل

Tailored proper scoring rules elicit decision weights

Proper scoring rules are scoring methods that incentivize honest reporting of subjective probabilities, where an agent strictly maximizes his expected score by reporting his true belief. The implicit assumption behind proper scoring rules is that agents are risk neutral. Such an assumption is often unrealistic when agents are human beings. Modern theories of choice under uncertainty based on ra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993